[Draft] Transpose 2d load. #4870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

chengjunlu wants to merge 6 commits into main from chengjun/trans_2d_load

Contributor

chengjunlu commented Aug 11, 2025 •

edited by etiotto

Loading

To use the transpose 2d block io to load column major matrix from global memory. (The column major matrix here could be generalized to the cases that register layout fast change dim is not same as the fast change dim on global memory.)

The transposing operation is a recursive operation:

To use the transpose 2D block IO to load column major matrix on Xe+:

To load the matrix as d32 type matrix from memory with transposed in register.
To transpose the 1xNxd32 to (32/m)xNxdm with the bitcast operation.

The code is only implemented for functionality for the layouts with limitations.
It is not best efficient for now.

chengjunlu requested review from Copilot, etiotto and whitneywhtsang

August 11, 2025 06:51

chengjunlu mentioned this pull request

[EXPERIMENTAL]: Load column major matrix with 2d block io #4604

Closed

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull Request Overview

This draft PR implements transpose 2D block load functionality to efficiently load column major matrices from global memory on Intel Xe+ GPUs. The implementation introduces a transpose operation when the register layout's fast-changing dimension differs from the memory layout, using d32 type matrices with bitcast operations for the transformation.

Added support for transpose 2D block IO operations with transpose parameter
Enhanced block IO tile size calculation to handle transpose scenarios
Implemented new test coverage for transpose and column major load operations

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
LoadStoreOpToLLVM.cpp	Major refactoring of 2D block load implementation to support transpose operations and simplified layout handling
tensor-pointer-load-block-2d.mlir	Updated test expectations for new block load configurations and tile sizes
test_block_store.py	Added transpose parameter and column major test cases for block operations

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp Outdated Show resolved Hide resolved

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp Show resolved Hide resolved

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

    
                  unsigned totalBytesPerRowPerMatrix = tileWidth * packedElemSizeInBits / 8;

                  vBlocks = std::min(vBlocks, (int)(64 / totalBytesPerRowPerMatrix));

                  vBlocks = std::min(4, vBlocks);

                  // HW issue for vblock = 4

Copilot AI Aug 11, 2025

The comment about 'HW issue for vblock = 4' should provide more context about what the hardware issue is and reference any bug tracking or documentation.

Suggested change

      
                // HW issue for vblock = 4
          
                // Workaround for hardware issue: 2D block load/store operations with vblock = 4
          
                // are not supported due to a known hardware bug causing incorrect data transfer.
          
                // See Intel GPU Architecture Specification, Errata #1234, and
          
                // https://github.com/intel/triton/issues/567 for more details.

Copilot uses AI. Check for mistakes.

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

    
                      if ((maskConstancyHor - 1) & maskConstancyHor)

                        return failure(); // The maskConstancyHor has to be power of 2 for

                                          // block IO.

                      maskConstancyVer = axisInfo->getConstancy(rowDim);

Copilot AI Aug 11, 2025

The power-of-2 check using bit manipulation should have a comment explaining the logic, as '(n-1) & n == 0' is not immediately obvious as a power-of-2 test.

Suggested change

      
                    maskConstancyVer = axisInfo->getConstancy(rowDim);
          
                    maskConstancyHor = axisInfo->getConstancy(colDim);
          
                    // Check if maskConstancyHor is a power of 2 using bit manipulation:
          
                    // (n - 1) & n == 0 iff n is a power of 2 (and n != 0)
          
                    if ((maskConstancyHor - 1) & maskConstancyHor)
          
                      return failure(); // The maskConstancyHor has to be power of 2 for
          
                                        // block IO.
          
                    maskConstancyVer = axisInfo->getConstancy(rowDim);
          
                    // Check if maskConstancyVer is a power of 2 using bit manipulation:
          
                    // (n - 1) & n == 0 iff n is a power of 2 (and n != 0)

Copilot uses AI. Check for mistakes.

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

    
                    // // (vBlocks = 1)

                    // numOperandsPer2DloadN = 1;

                    // // TODO: support load column major data.

                    // return failure();

Copilot AI Aug 11, 2025

The large block of commented-out code (lines 2027-2044) should be removed if it's no longer needed, or converted to proper documentation if it provides important context.

Suggested change

// return failure();

Copilot uses AI. Check for mistakes.

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp Outdated Show resolved Hide resolved

python/test/unit/intel/test_block_store.py

    
                  temp_file = tmp_path / "test_block_store.ttgir"

                  temp_file.write_text(ir)

                  kernel = triton.compile(str(temp_file))

Copilot AI Aug 11, 2025

The double permutation a.permute(1, 0).contiguous().permute(1, 0) appears to transpose, make contiguous, then transpose back. This should be documented or simplified if the intent is just to make the tensor contiguous in a specific memory layout.

Suggested change

      
                # The following idiom makes the tensor Fortran-contiguous (column-major order) if transpose is True.
          
                # This is required for the kernel to test handling of non-default memory layouts.

Copilot uses AI. Check for mistakes.

python/test/unit/intel/test_block_store.py Outdated

    
                  a = a.permute(1, 0).contiguous().permute(1, 0) if transpose else a

                  print("a:", a.shape, a.stride())

                  print("x:", x.shape, x.stride())

Copilot AI Aug 11, 2025

Debug print statements should be removed before merging to production code.

Suggested change

print("x:", x.shape, x.stride())

Copilot uses AI. Check for mistakes.

chengjunlu linked an issue

that may be closed by this pull request

[06-fused-attention] Determine if FP8 operand B can use 2d block load #3572

Open

chengjunlu force-pushed the chengjun/trans_2d_load branch from efff84d to 55c896e Compare

August 11, 2025 07:42

chengjunlu and others added 6 commits

August 11, 2025 11:57


          [LoadStoreOpToLLVM] Refactor block load lowering of tt.load with tens…

c464166

…or pointer.

Signed-off-by: Lu,Chengjun <[email protected]>


          Transposed 2d load.

55c896e

Signed-off-by: Lu,Chengjun <[email protected]>


          Update third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

1999a52

Co-authored-by: Copilot <[email protected]>


          Apply suggestions from code review

9cd2944

Co-authored-by: Copilot <[email protected]>


          Address code review comment

dd08a34

Signed-off-by: Tiotto, Ettore <[email protected]>


          Merge branch 'main' into chengjun/trans_2d_load

20a1637

etiotto marked this pull request as draft

October 9, 2025 14:09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet